In this project, I explore the quality of the red wine. This dataset contains 1,599 red wine records with 11 variables on the chemical properties of the wine. At least three wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Through these variables, I’d like to uncover which chemical properties influence the quality of red wines and Which variable affect most? Perhaps the results can help me in choosing red wine next time at the Wine shop.
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## 7 7 7.9 0.60 0.06 1.6 0.069
## 8 8 7.3 0.65 0.00 1.2 0.065
## 9 9 7.8 0.58 0.02 2.0 0.073
## 10 10 7.5 0.50 0.36 6.1 0.071
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## 7 15 59 0.9964 3.30 0.46 9.4
## 8 15 21 0.9946 3.39 0.47 10.0
## 9 9 18 0.9968 3.36 0.57 9.5
## 10 17 102 0.9978 3.35 0.80 10.5
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## 7 5
## 8 7
## 9 7
## 10 5
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
For the structure of the dataset, we have 11 variables, in which the ‘quality’ is the dependent variable and other ten properties are the independent variables. Also, there are no missing values in these variables, and the dataset itself is pretty tidy.
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (tartaric acid - g / dm^3)
2 - volatile acidity: the amount of acetic acid in wine (acetic acid - g / dm^3)
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines (g / dm^3)
4 - residual sugar: the amount of sugar remaining after fermentation stops (g / dm^3)
5 - chlorides: the amount of salt in the wine (sodium chloride - g / dm^3)
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (mg / dm^3)
7 - total sulfur dioxide: amount of free and bound forms of S02 (mg / dm^3)
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content (g / cm^3)
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels (potassium sulphate - g / dm3)
11 - alcohol: the percent alcohol content of the wine (% by volume)
12 - quality (score between 0 and 10)
Take a quick look at the dataset; I can find some small issues need to fix before the EDA. Even though this dataset is pretty tidy, setting these issues will make the dataset perfect to conduct further exploration.
Change the variable names to make all variable names identical – change ‘citric.acid’ to ‘citric.acidity.’
Standardize the measuring method of the chemical properties of the wine – change mg / dm^3 to g / dm^3 for ‘total sulfur dioxide’ and ‘free sulfur dioxide.’
# 1. rename col
colnames(wine)[colnames(wine)=="citric.acid"] <- "citric.acidity"
# 2. change measurement
wine$total.sulfur.dioxide <- wine$total.sulfur.dioxide / 1000
wine$free.sulfur.dioxide <- wine$free.sulfur.dioxide / 1000
# 3. check the result
tail(wine)
## X fixed.acidity volatile.acidity citric.acidity residual.sugar
## 1594 1594 6.8 0.620 0.08 1.9
## 1595 1595 6.2 0.600 0.08 2.0
## 1596 1596 5.9 0.550 0.10 2.2
## 1597 1597 6.3 0.510 0.13 2.3
## 1598 1598 5.9 0.645 0.12 2.0
## 1599 1599 6.0 0.310 0.47 3.6
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1594 0.068 0.028 0.038 0.99651 3.42
## 1595 0.090 0.032 0.044 0.99490 3.45
## 1596 0.062 0.039 0.051 0.99512 3.52
## 1597 0.076 0.029 0.040 0.99574 3.42
## 1598 0.075 0.032 0.044 0.99547 3.57
## 1599 0.067 0.018 0.042 0.99549 3.39
## sulphates alcohol quality
## 1594 0.82 9.5 6
## 1595 0.58 10.5 5
## 1596 0.76 11.2 6
## 1597 0.75 11.0 6
## 1598 0.71 10.2 5
## 1599 0.66 11.0 6
From the structure of the dataset, we can see that all the variables are numerics except the quality. It’s ok to keep this data type without changing it from integer to factor since I need to create a new variable called ‘quality.bucket’ to set three levels of wine quality: low/medium/high based on the actual distribution of quality (from 1 to 8).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
## Low Medium High
## 63 1319 217
The quality shows a roughly binomial distribution, since the majority of red wines rated as 5 or 6. That means the sample is unbalanced in quality; we have few exceptinal or poor quality wines. It’s also surprising that no any wines rated as 1-2 or 9-10, even the rating system is between 0 and 10. That could imply that this red wine dataset has already filtered some very bad or very good wines, in that case may hurt the the validty of this sample.
However, the new variable ‘quality.bucket’ that has three levels of quality displays a normal distribution; the medium quality wines dominate the entire quality data.
For all the independent variables, I check the statistical summary and plot the histogram one by one and correspondingly make some transformation if it’s necessary. I also make boxplots to depict the outliers for better understanding the distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
This histogram of density shows a normal distribution with the mean of 0.9967 and range of 0.9901 to 1.0037. The red line (mean) and green line (median) coincide that also indicates a normal distribute shape. Further, I change the binwidth for better visulization.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
This histogram of pH value shows a normal distribution with the mean of 3.311 and range of 2.740 to 4.010. The red line (mean) and green line (median) coincide that also indicates a normal distribute shape. Further, I change the binwidth for better visulization.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
This histogram of alcohol shows a slightly positive skewness (right-skewness) in distribution. The range of alcohol content is between 14.90 and 8.40 while the mean is 10.42 and the median is 10.20. The boxplot aslo shows that a few values around 14 are the outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The distribution of residual.sugar is a right-skewed distribution which has a long right tail. To imporve that, I take the log base10 to transform the data scale and get a relatively better distribution. From the boxplot, it also confirms that lots of outliers are between 4 and 16, which is greater than the mean (2.539) and 3rd Quartile (2.600).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Since the distribution of chlorides is a right-skewed distribution, I also use the log base10 to transform the data scale and get a relatively better distribution. From the boxplot, it also confirms that lots of outliers (red dots) greater than the mean (0.08747) and 3rd Quartile (0.09000).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00600 0.02200 0.03800 0.04647 0.06200 0.28900
The total.sulfur.dioxide holds a positive-skew distribution between 0.00600 to 0.28900 with the mean of 0.04647. I also use the log base10 to transform the data scale and get a relatively better distribution. From the boxplot, it shows that a few outliers (red dots) locate between 0.1 and 0.2. It is worth noting that there’re two extreme outliers valued 0.25 - 0.30.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The distribution of sulphates is a right-skewed distribution which has a long right tail. To imporve that, I take the log base10 to transform the data scale and get a relatively better distribution. From the boxplot, it also confirms that lots of outliers are between 4 and 16, which is way greater than the mean (0.6581) and 3rd Quartile (0.7300).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The distribution of citric.acidity has a multimodal distribution, ranging from 0 to 1 with the mean of 0.271. After taking the sqrt scale transformation, it’s easily to see three peaks around 0.00, 0.25, and 0.50. The boxplot indicats few outliers greater than 0.75, while at the bottom of 0 and 0.50 lots of dots accumulate there.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00100 0.00700 0.01400 0.01587 0.02100 0.07200
Free.sulfur.dioxide has a right-skewed distribution between 0.00100 and 0.07200 with the mean of 0.01587. To imporve that, I take the log base10 to transform the data scale and get a relatively better distribution. From the transformed plot, I can see two peaks there around 0.0006 and 0.0026. The boxplot also confirms that a few outliers are between 0.04 and 0.07, which is way greater than the mean (0.01587) and 3rd Quartile (0.02100).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
This histogram of fixed.acidity shows a slightly positive skewness (right-skewness) in distribution, so I only change the binwidth size for better visulization. The range of alcohol content is between 4.60 and 15.90 while the mean is 8.32 and the median is 7.90. The boxplot aslo shows that a few values above 0.04 are the outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
This histogram of volatile.acidity shows a nearly normal distribution with the mean of 0.5278 and range of 0.1200 to 1.5800. Also, I change the binwidth for better visulization. The red line (mean) and green line (median) are close together that also indicates a nearly normal distribute shape, while the boxplot reflects a few outliers above 0.04.
From the plots and summaries above, we can see that distributions of the variables are varied. Many variables are over-dispersed with very long tails. Thus, I need to transform these values to short the tails. Dependent on different distributions, I transform the variables (‘residual.sugar’, ‘chlorides’, ‘total.sulfur.dioxide’, ‘sulphates’) by taking the log base10 and transform the variables (‘citric.acidity’, ‘free.sulfur.dioxide’) by taking the sqrt.
I also apply the transformation to the variables (‘fixed.acidity’, ‘volatile.acidity’) but there’s no significant change in the distribution, for that reason, I keep the original distribution status for these three variables. Besides, no need to change in the distributions of three variables: ‘density,’ ‘pH,’ and ‘alcohol.’ In short, I trim the long tails for most variables by taking log base10 and sqrt,at the same time adjust the binwidth of them for better visualization, and finally check the outliers through boxplots.
For the structure of the dataset, we have 11 variables, in which the ‘quality’ is the dependent variable and other ten properties are the independent variables. Also, there are no missing values in these variables, and the dataset itself is pretty tidy.
The ‘quality’ is the dependent variable which I am interested most. I also feel interested in alcohol, residual.sugar, density, and pH, as those might be the critical factors to red wine quality rating.
citric.acidity and chlorides
Yes. I create a new variable called ‘quality.bucket’ to set three levels of wine quality: low/median/high based on the true distribution of quality (from 1 to 8).
Many variables are over-dispersed with very long tails. Thus, I need to transform these values to short the tails. Dependent on different distributions, I transform the variables (‘residual.sugar’, ‘chlorides’, ‘total.sulfur.dioxide’, ‘sulphates’) by taking the log base10 and transform the variables (citric.acidity’, ‘free.sulfur.dioxide’) by taking the sqrt.
My main interest is to find the features which correlated most to the quality. I am also very interested in some particular factors, such as alcohol, residual.sugar, and density, becasue I expected these factors would impact the quality of red wines based on my life experience. Thus, I first plot a correlation plot to check all the bivariate relations to test my hypotheses.
##
## Pearson's product-moment correlation
##
## data: quality and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
##
## Pearson's product-moment correlation
##
## data: quality and sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
##
## Pearson's product-moment correlation
##
## data: quality and volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
##
## Pearson's product-moment correlation
##
## data: quality and residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
The Bivariate Correlation Matrix indicates that the ones with the highest correlation with ‘quality’ by the Pearson method are alcohol (r^2 = 0.48), volatile acidity (r^2 = -0.39), and sulphates (r^2 = 0.25). The results confirm the strong postive correlation between alcohol and quality, while reject the guessing that residual.sugar also has certain correlation with quality. The r^2 is 0.013 between quality and residual.sugar, which it’s too weak to make any conclusion.
Other features also have quite strong relationship between each other. For example, density vs. fixed.acidity (r^2 = 0.67), fixed.acidity vs.citric.acidity (r^2 = 0.67), fixed.acidity vs. pH (r^2 = -0.68), total.sulfur.dioxide vs. free.sulfur.dioxide (r^2 = 0.67). However, those strong relationship perhaps due to sharing similar chemical properties or elements.
##
## Pearson's product-moment correlation
##
## data: quality and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
The plot above indicates the positive relationship between alcohol and quality and in a detailed way. We have learned that the alcohol positively correlates with quality in correlation coefficient metric (r^2 = 0.48), but the relationship is not that straightforward. The first three boxplots show that the change of alcohol percentage had less impact on the low and medium quality wines, which means there’s no much difference in alcohol for 3-5 rating wines. It seems that above rating 6, the higher alcohol content the higher the rating values.
I also add markers to show the mean values and jitter points to the plots, thus it confirms that the majority of sample data concentrate on the medium quality wines (rating 5 - rating 6). At the same time, the rating 5 wines have a few outliers with high alcohol content. Therefore, the alcohol positively correlates with quality and in specifically high quality wines has high acohol content.
##
## Pearson's product-moment correlation
##
## data: quality and sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
The perivous bivariate correlation matrix indicates that there’s a weak postive relationshiop between sulphates and quality (r^2 = 0.25). This plot above confirms this result and as the sulphates content increases, the quality slowly improves.
## [1] -0.3905578
The perivous bivariate correlation matrix indicates that there’s a moderate neagtive relationshiop between volatile.acidity and quality (r^2 = -0.39). This plot above verifis this result and as the volatile.acidity content decreases, the quality increases.
##
## Pearson's product-moment correlation
##
## data: residual.sugar and density
## t = 15.189, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3116908 0.3973835
## sample estimates:
## cor
## 0.3552834
Even through the residual.sugar seems has no relationship with my major focus – quality, there’s another interesting point I was found from the plot above. The higher red wine density, the higher residual.sugar, with a moderate positive correlation (r^2 = 0.36). One possible explanation is that people are sensitive to sugar and thus weight sweet taste as an essential measure of the feeling of density.
##
## Pearson's product-moment correlation
##
## data: alcohol and density
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
As alcohol postively correlates with quality, I expect the same result will appear on the relationship between alcohol and density. Surprisingly, the plot above shows there’s a strong negative relationsip between alcohol and density (r^2 = -0.5), which means less alcohol involves higher the density.
Alcohol percentage of wine (r^2 = 0.48) has a high positive relationship with quality while other two features volatile acidity (r^2 = -0.39) and sulphates (r^2 = 0.25) also correlated with quality of the red wine.
Yes. First, the higher red wine density, the higher residual.sugar, with a moderate positive correlation (r^2 = 0.36). Second, alcohol negatively correlated with density (r^2 = -0.5), which means less alcohol involves higher the density.
The strongest realtionship I found is the nagative relationship between fixed.acidity and pH (r^2 = -0.68), although it is not related to my main focus.
The previous bivariate analysis suggests that alcohol negatively correlated with density (r^2 = -0.5) as well as alcohol positively associated with quality (r^2 = 0.48). Thus, I cobmine those three important features together to make a plot to do a multivariate analysis.
From the plot above, it seems that in general, better quality wines involves higher alcohol percentage as well as lower density. The dark blue dots with high quality locate at the medium and upper areas while the light blue dots locate at the bottom area. I also add a regression line for each quality level. The regression lines all show the negative relationship between alcohol and density.
## Warning: Removed 156 rows containing non-finite values (stat_smooth).
## Warning: Removed 156 rows containing missing values (geom_point).
Similarly, the previous analysis shows that density positively correlated with residual.sugar ((r^2 = 0.36) but slightly negatively correlated with quality ((r^2 = -0.17). In this multivariate plot below, the regression lines indicate the positive relationship between density and residual.sugar as well as the slight negative relationship between density and quality (a little more dark blue dots stay in the lower area). However, it’s hard to find a correlation between residual.sugar and quality; the quality colored dots distributes evenly along with to residual.sugar amount change.
I found that better quality wines involve higher alcohol percentage as well as lower density. Besides, density also positively correlated with residual.sugar.
One interesting point is that the alcohol has a strong positive relationship (r^2 = 0.48) with quality but at the same time has a strong negative relationship with density (r^2 = -0.5). For the result, quality also negatively correlates to density but not that strong.
The first plot I chose to give us a full picture of the correlations between each variable. The colors and numbers provide an intuitive way to understand all the bivariate correlations immediately, that provides the ideas and clues for us to do further data exploration.
The second plot I chose indicates the positive relationship between alcohol and quality but in a detailed way. The first three boxplots show that the change of alcohol percentage had less impact on the low and medium quality wines, which means there’s no much difference in alcohol for 3-5 rating wines. It seems that the alcohol positively correlates with quality and in specifically high quality wines has high acohol content.
The third plot I chose indicates the strong and most meaningful relationship in my analysis, that is, high-quality wines come with higher alcohol percentage as well as lower density.
From this red wine dataset, I would like to explore which chemical properties influence the quality of red wines and which variable affect most? After the EDA, I found that high-quality wine seems to have more alcohol percentage in them, in other words, the alcohol is a reliable indicator to estimate the quality if red wine, especially differentiating between high quality and low/medium red wines. Furthermore, better quality wines also hold lower density.
Besides, other features also have quite a strong relationship between each other, such as density vs. fixed.acidity, fixed.acidity vs.citric.acidity, fixed.acidity vs. pH, total.sulfur.dioxide vs. free.sulfur.dioxide. However, those strong relationships perhaps due to sharing similar chemical properties or elements but I cannot quite understand, and that is away from my focus on quality.
I can see the positive relationship between alcohol and quality of red wine and some others, which doesn’t mean causation between those two variables. Only controlled experiment can tell me the causation.
This red wine dataset only has 1,599 records. The sample size is too small to claim effective argument. Regarding the most important variable – quality, there are much more normal(medium) wines than excellent or poor ones. The unbalanced distribution of wine quality perhaps generates a bad result if we use this raw data to build a predictive model.
In this dataset, most of the variables are physicochemical data. There is no data about grape types, wine brand, wine price, region/location; those are critical information about the red wine. I am also very interested in how the quality affects the price of wine, but there’s no variable of the selling price. Besides, the quality rating is based on three wine experts, and I doubt the objective of this rating system.
WineQualityInfo: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt